Goal of this project:
Using R and other exploratory data analysis techniques, explored relationships
within the red wine quality dataset: how chemical properties influence the
quality of red wine among others.
RedWineQuality dataset contains 1599 red wine observations of 12 variables
(chemical properties of wine).The output variable ‘quality’ (based on sensory
data) were scored between 0 (very bad) and 10 (very excellent).
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
All variables are numeric type except for quality, which is integer.
Before exploration, I’d like to see the summary of ‘quality’ variable:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
As above, the range of wines quality was from 3 to 8 in this dataset. 82.5% of wine
were scored between 5 and 6.
## quality_bucket
## Bad (Rating 3 - 4) Normal (Rating 5) Good (Rating 6)
## 63 681 638
## Excellent (Rating 7 - 8)
## 217
I created quality_bucket to group quality ratings. Wines receiving 3 and 4 quality
score grouped in “Low” quality_bucket, wines receiving 5 and 6 quality score
grouped in “Medium” quality_bucket, and wines receiving 7 and 8 quality score
grouped in “High” quality_bucket.
Let’s look at each chemical variable’s distributions:
Most distributions tend to be positively skewed. I noticed that the distributions
of fixed.acidity, density and pH are symmetric.
It’s clear to see outliers existed in above box plots.
According to descriptive statistic concepts, I defined data points outside 1.5
times the interquartile range above the upper quartile and bellow the lower
quartile as outliers. Since most distributions are positively skewed, meaning
most of outliers are on the larger side. I decided to remove outliers if it is
greater than Q3 + 1.5IQR.
Let’s see the distributions of each variable after removing outliers:
After removing the outliers, distributions of citric.acid, free.sulfur.dioxide,
total.sulfur.dioxide and alcohol still remain slightly positively skewed.But from
boxplots, we can see the most of outliers have been removed. Overall, distributions
of each variable tend to be symmetric.
alcohol_lowerq = quantile(wine$alcohol)[2]
alcohol_upperq = quantile(wine$alcohol)[4]
alcohol_upper = (IQR(wine$alcohol) * 1.5) + alcohol_upperq
alcohol_lower = alcohol_lowerq - (IQR(wine$alcohol) * 1.5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.10 10.39 11.00 13.40
After removing outliers, the mean of alcohol is 10.39.There is a peak around 9.3.
Let’s look alcohol distributions in quality buckets:
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.580 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
It seems to be a tendency that wines with higher alcohol mean tend to have better
quality.
## # A tibble: 6 x 4
## quality alco_mean alco_median n
## <int> <dbl> <dbl> <int>
## 1 3 9.955000 9.925 10
## 2 4 10.265094 10.000 53
## 3 5 9.899706 9.700 681
## 4 6 10.629519 10.500 638
## 5 7 11.465913 11.500 199
## 6 8 12.094444 12.150 18
I grouped a subset table ‘wine.alco_by_quality’, describing alcohol
categorized in quality. I noticed that wine quality scoring 8 has the highest
alcohol mean around 12.09%, and the highest median around 12.15%. But these cannot
proof any linear or correlations yet.
I looked at the mean and median of alcohol in each quality category, and I’m
curious to find out if alcohol influence the quality of wine. And if there’s
other variables together with alcohol influence the quality.
There are 1599 wine observations in the dataset with 12 features
(fixed acidity, volatile acidity, citric acid, residual sugar, chlorides,
free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and
quality). The output variable quality is based on sensor data, scoring between
0 and 10.
I set the ‘quality’ variable as ordered factor variable. Its levels are showed
as below:
(very bad) —–> (very excellent)
quality: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
While in the dataset, quality variable ranges between 3 and 8. I grouped quality
into buckets: Bad bucket (rating 3-4), Normal bucket(rating 5), Good bucket (rating 6),
and Excellent bucket (rating 7-8).
Other observations:
The main features of interest in my dataset are quality, alcohol, density and
citric.acid. I’d like to know which feature or features combination are best
for predicting the quality of wine.
I suspect alcohol or citric.acid and some combination of the other variables
can influence the quality of wine. This suspection may help me build a
predictive model for wine quality in the following analysis.
Features like pH, density will help support my investigation because I suspect
alcohol might influence the density of water in wine, and pH might be influenced
by alcohol and citric.acid. However, I will decide the final list of features I
will explore at the next section. By using a correlation matrix plot at the next
section, I will be more confident to choose interested features to explore.
I created a subset named ‘wine.alco_by_quality’ to better see if there’s
correlation between these two variables.
Also, I created quality_bucket to group quality ratings. Wines receiving 3 and 4 quality
score grouped in “Low” quality_bucket, wines receiving 5 and 6 quality score
grouped in “Medium” quality_bucket, and wines receiving 7 and 8 quality score
grouped in “High” quality_bucket.
I found some outliers in the dataset, and plotted box plots for each variables
to see where did these outliers lie in. For example, I found some outliers in
alcohol variable (below 9 or above 14). Also, I noticed that the best quality
category has the biggest mean 12.09 and median of alcohol 12.88. But it doesn’t
mean any liner or correlation between alcohol and quality. I will further
analysize them in the following section.
Citric.acid distribution has several peaks and is slightly skewed to the
right. The highest peak is at 0.00, and there’s another 3 relatively small
peaks in the distribution. I also noticed an outliner, which is at 1.00. I
checked the wine with 1.00 citric.acid and found it is in quality 4.
Instead of randomly ploting any potential relationships among these variables,
I used a correlation matrix plot to find meaningful correlations between variables.
I’m interested in color scale that extends from -0.2 to -1, or from 0.2 to 1. These
ranges of scales represent medium or strong correlations between given variables.
Firstly, I noticed there are 4 variables have meaningful correlations with quality.
They are alcohol, sulphates, citric.acid and volatile.acidity. Secondly, I found the
correlations between density and fixed.acidity, pH and fixed.acidity are moderate.
Also, I’m interested to explore the relationships of alcohol and density.
To sum it up, I’m going to explore the relationships between:
Furthermore, I assume there will be more than just 1 variable influencing the quality
of red wines. Therefore, I will test the correlation of quality and different combinations
of alcohol, sulphates, citric.acid and volatile.acidity variables.
In addition, I noticed 2 strong but meaningless correlations, which are citric.acid
and fixed.acidity, total.sulfur.dioxide and free.sulfur.dioxide. Their correlations
are strong becasue one is the other’s subset.
I removed outliners in alcohol to see if the relationship between alcohol and
quality would be stronger. It turned out just a little bit stronger. So It’s
better to use Pearson’s correlation to test these two. And maybe there’s more
variables participate into this relationship.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Above Pearson’s correlation result shows there’s a moderate correlation between
alcohol and quality. To be more specific, wine with higher alcohol tend to be
in better quality.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
There’s a moderate negative correlation between alcohol and density variables. To be
specific, wine with higher alcohol tend to have lower density.
##
## Pearson's product-moment correlation
##
## data: wine$citric.acid and wine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
There’s a small positive correlation between citric.acid and quality.
## quality_bucket: Bad (Rating 3 - 4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0200 0.0800 0.1737 0.2700 1.0000
## --------------------------------------------------------
## quality_bucket: Normal (Rating 5)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## quality_bucket: Good (Rating 6)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## quality_bucket: Excellent (Rating 7 - 8)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3000 0.4000 0.3765 0.4900 0.7600
Better quality wine have higher citric.acid mean. Excellent wine have the highest
citric.acid mean around 0.3765 g / dm^3 and the highest citric.acid median around
0.4 g / dm^3.
Although citric.acid would add ‘freshness’ or flavor to wine, there’s few correlation
between quality and citric.acid. But there’s a tendency that better quality wine
has higher mean citric.acid.
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
fixed.acidity and density have a moderate postive correlation.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
I used a correlation matrix plot to find moderate or strong correlations
between given variables. There are 4 variables have meaningful correlations with quality.
They are alcohol, sulphates, citric.acid and volatile.acidity. I explored correlations
between: alcohol and quality, citric.acid and quality, density and fixed.acidity,
alcohol and density.
I found alcohol and quality have a moderate correlation that wine with
higher alcohol tend to be in better quality.The correaltion is around 0.476.
Fixed.acidity and density have a moderate postive correlation.
Few correlation is existed between quality and citric.acid. But I found that
better quality wine has higher mean citric.acid. For example, Excellent wine
have the highest citric.acid mean around 0.3765 g / dm^3 and the highest
citric.acid median around 0.4 g / dm^3.
There’s a moderate correlation between alcohol and density. To be specific, wine with higher alcohol tend to have lower density. The correlation
is around -0.496.
fixed.acidity and density have the strongest correlation around 0.67.
Four quality groups followed the relationship between density and
alcohol.
Quality groups follow the relationship of pH and citric.acid. The low quality
group has a relatively bigger range of citric.acid. Also, I noticed there’s a lot
medium quality wine have 0 citric.acid, compared to low and high quality groups.
Quality groups all followed the positive correlation between fixed.acidity and density.
As fixed.acidity volumn increases, the density of wine tends to increase.
By calculating r-squared value, I want to test if the strongest variable alcohol
would strong r-squared value to proof its linear relationship with quality.
m1 <- lm(wine$quality ~ wine$alcohol)
summary(m1)$r.squared
## [1] 0.2267344
I chose alcohol (have the strongest correlation with quality among other variables)
to test the lineary relation with quality. Unfortunately, the r-squared is not
strong (0.22673).
I decide to add more variables into this model to see if the r-squared value would
improve. If the value increase, it means that the combination of the added variables
and alcohol together influence the quality of wines.
From previous correlation matrix plot, I noticed that other than alcohol, there
are 3 variables have meaningful correlations with quality. They are sulphates,
citric.acid and volatile.acidity. Let’s add these variables one at a time.
m1 <- lm(wine$quality ~ wine$alcohol)
m2 <- lm(wine$quality ~ wine$alcohol+wine$sulphates)
m3 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid)
m4 <- lm(wine$quality ~ wine$alcohol+wine$density+wine$citric.acid+wine$volatile.acidity)
summary(m1)$r.squared
## [1] 0.2267344
summary(m2)$r.squared
## [1] 0.2698912
summary(m3)$r.squared
## [1] 0.2576685
summary(m4)$r.squared
## [1] 0.3189737
When I added each of the variables of interest into this model, the r-squared
value did improve from 0.22673 to 0.32. It means these variables combined together
to influence the overall quality of wines. But the final r-squared value is still
not strong enough (I consider an r-squared value over 0.5 as strong enough).
From bivariate analysis I found out that density and alcohol have a modereate
negative correlation. And from multivarite analysis by adding quality groups into
the plot, I found out that quality gourps follow the relationship of density and
alcohol.
Quality groups all followed the positive correlation between fixed.acidity and
density. As fixed.acidity volumn increases, the density of wine tends to increase.
I noticed that among my interested variables, alcohol has the strongest
relationship with quality. So I calculated its r-squared value. Although the
r-squared value between them is not strong (around 0.22673), it did improve
from 0.22673 to 0.32 when I added variables sulphates,citric.acid and volatile.acidity
one at a time.
Depending on the Pearson correlation value, I thought the r-squared value
between alcohol and quality must be strong, at least bigger than 0.5. But it
turned out my suspection was wrong. But it did surprised my that the r-squared
value increased every time I added another featured variables into the model.
It also surprised me that quality groups all follow the meaningful relationships
which I found in bivariate analysis. To be specific, quality groups follow the
relationships of alcohol and density, fixed.acidity and density, pH and citric.acid.
I created a math linear model.I sed quality as dependent variable, and alcohol as
independent variable. After I found out the r-squared value is not strong enough,
I added sulphates,citric.acid and volatile.acidity one at a time as independent
variable into the model. The result r-squared value did improve to 0.32, but still not
strong enough.
The model clearly shows each r-squared value when you added a new featrued variable.
So it’s easy and clear to see the result that if they have linear correlation.
But there’s limitations of this model. Since I didn’t put all the variables in the
dataset to test the model. There may still be some potential variables which can
influence the quality of wines that I didn’t include in the model.
Alcohol and density have a moderate negative correlation around -0.496. Wine with
higher alcohol percentage by volume tend to have lower density (g / cm^3). And
all wine quality groups follow the relationship of alcohol and density.
Alcohol have strongest correlation with quality around 0.476. Wines with higher
alcohol percentage by volume tend to be in better quality.But I did notice that
wine with quality scoring 5 is a bit out of the line. It might because there’s
still potential variables (toghether with alcohol to influence quality) that I
didn’t discuss.
Fixed.acidity (tartaric acid - g / dm^3) and density (g / cm^3) have a moderate
postive correlation around 0.67. Wine with higher fixed.acidity tend to have higher
density. And all wine quality groups follow this relationship of fixed.acidity
and density.
This Red Wine Quality dataset contained 1,599 observations of red wines. There’re
12 variables in the dataset, including 11 variables of chemical properties in
these wines, and 1 output variable of wine quality, which graded by experts and
is between 0 (very bad) and 10 (very excellent).
I’m interested in exploring how these chemical properties influence the quality
of wine. I ploted a correlation matrix plot to decide variables for further
exploring. Through univariate, bivariate, multivariate analysis and statistical
analysis, I tested different meaningful relationships between these variables.
Among the variables included in the dataset, alcohol had the strongest correlation
with wine quality. The correlation is around 0.476. Wines with higher alcohol
percentage by volume tend to be in better quality. Unfortunately, the calculated
r-squared value between alcohol and quality is not strong (around 0.22673).
I decide to add other correlated variables which I found from the correlation
matrix plot. I added sulphates, citric.acid and volatile.acidity one at a time
into the model. It turned out that the r-squared value did improve from 0.22673
to 0.32.
I think the limitations of this dataset would be one of the major challenges.
Amond 1,599 obeservations of wines, 82.4% of wines received score of 5 or 6.
Around 4% of wines received score of 3 or 4, and 13.6% of wines received score
of 7 or 8. It would be better to have a larger variety of quality score for the
dataset.
For future further analysis, it would be interesting and meanfing to combine or
compare this dataset with the white wine datast. So we can see how these chemical
properties’ correlation with quality changed.